Learning Curves for Noisy Heterogeneous Feature-Subsampled Ridge Ensembles

Feature bagging is a well-established ensembling method which aims to reduce prediction variance by combining predictions of many estimators trained on subsets or projections of features. Here, we develop a theory of feature-bagging in noisy least-squares ridge ensembles and simplify the resulting learning curves in the special case of equicorrelated data. Using analytical learning curves, we demonstrate that subsampling shifts the double-descent peak of a linear predictor. This leads us to introduce heterogeneous feature ensembling, with estimators built on varying numbers of feature dimensions, as a computationally efficient method to mitigate double-descent. Then, we compare the performance of a feature-subsampling ensemble to a single linear predictor, describing a trade-off between noise amplification due to subsampling and noise reduction due to ensembling. Our qualitative insights carry over to linear classifiers applied to image classification tasks with realistic datasets constructed using a state-of-the-art deep learning feature map.


I. INTRODUCTION
Ensembling methods, where one combines predictions from multiple predictors to achieve a stronger prediction, are ubiquitous in machine learning practice [1]. A popular class of ensembling methods (known as attribute bagging [2] as well as the random subspace method [3]) are based on feature subsampling [2-6], where each predictor has access to only a subset of data features, are independently trained on those features, and their predictions are combined to achieve a stronger prediction. For example, the popular random forest method makes use of this strategy [3,7]. An advantage of these methods is that they allow parallel processing. For example, Feature-Distributed Machine Learning, combine independent predictions made by agents who only see subsets of available features [8].
While commonly used in practice, a theoretical understanding of ensembling via feature subsampling is not well developed. Here, we provide an analysis of this technique in the case of feature-subsampled linear ridge regression using methods from statistical physics [9][10][11][12]. This allows us to obtain analytical expressions for typical case performance of feature-subsampled linear ridge regression. Analysis of these equations under special cases reveal interesting phenomena involving effects of noise, regularization, and subsampling on prediction performance.
Our findings relate to double-descent [13,14], which results from over-fitting to noise and poses a serious problem for practical machine learning. Regularization is commonly used to mitigate double descent, however optimal regularization strength depends on data and noise levels [15,16]. Our theory reveals an alternative strategy. We observe that subsampling shifts the location of a predictor's sample-wise double-descent peak [14,16,17]. An interesting consequence of this is that if the predictors are heterogeneous in the number of features they see, they will go through double-descent at different sample-sizes. Therefore, bagging them will lead a mitigation of double-descent, as when one predictor fails, the others will compensate with accurate predictions.
In summary, we make the following original contributions: • Using the replica trick from statistical physics [9, 11], we derive the generalization error of ensembled least-squares ridge regression with random structured Gaussian data, deterministic feature maps, and a noisy linear teacher function. Our derivation allows for heterogeneity in the rank of the feature maps of the ensemble members. • We derive explicit formulas which demonstrate that subsampling alters the interpolation threshold of ridge regression.
• We demonstrate benefits of heterogeneous ensembling as a robust method for mitigating double-descent.
• We analyze the role of data correlations, readout noise, and data-task alignment in determining the optimal ensembling strategy in a tractable special case.
Related works: A substantial body of work has elucidated the behavior of linear predictors for a variety of feature maps [13,16,[18][19][20][21][22][23][24][25][26][27][28][29]. Several recent works have extended this research to characterize the behavior of ensembled regression using solvable models [25, 30,31]. Ref. [30] derives expressions for the generalization error of generalized linear models, of which ridge ensembles are a special case, in terms of the solutions to a set of self-consistent equations. However, [30] and [25] focus their analysis on the case of isotropic data and Gaussian random masks of homogeneous dimensionality. In contrast, we explicitly consider learning from correlated data by ensembles with heterogeneous readout dimensionality. Our work focuses on the effect of feature-wise subsampling. Additional recent works study the performance of ridge ensembles with example-wise subsampling [32,33] and simultaneous subsampling of features and examples [31]. These works find that subsampling behaves as an implicit regularization, and prove equivalences between optimal ensembling and optimal regularization. In a similar vein, we consider here ensembling as a safeguard against insufficient regularization. Methods from statistical physics have long been used for machine learning theory [10][11][12]. Relevant work in this domain include [34] which studied ensembling by data-subsampling in linear regression.

II. LERANING CURVES FOR ENSEMBLED RIDGE REGRESSION FROM THE REPLICA METHOD
We consider noisy ensembled ridge regression in the setting where ensemble members are trained independently on masked versions of the available features. We derive our main analytical formula for generalization error of ensembled linear regression, as well as analytical expressions for generalization error in the special case of subsampling of equicorrelated features. Later sections illustrate the implications of the derived formulas.

A. Problem Setup
Consider a training set D = {ψ µ , y µ } P µ=1 of size P . The training examplesψ µ ∈ R M are drawn from a Gaussian distribution with Gaussian feature noise: We seek to analyze the quality of predictions which are averaged over an ensemble of ridge regression models, each with access to a subset of the features. We consider k linear predictors with weightsŵ r ∈ R Nr , r = 1, . . . , k. Critically, we allow N r ̸ = N r ′ for r ̸ = r ′ , which allows us to introduce structural heterogeneity into the ensemble of predictors. A forward pass of the model is given as: The model's prediction f (ψ) is an average over k linear predictors. The "measurement matrices" A r ∈ R Nr×M act as linear masks restricting the information about the features available to each member of the ensemble. Subsampling may be implemented by choosing the rows of each A r to coincide with the rows of the identity matrix -the row indices corresponding to indices of the sampled features. The feature noise σ ∼ N (0, Σ 0 ) and the readout noises ξ r ∼ N (0, η 2 r ), are drawn independently at the execution of each forward pass of the model. Note that while the feature noise is shared across the ensemble, readout noise is drawn independently for each readout: E[ξ r ξ r ′ ] = δ rr ′ η 2 r . The weight vectors are trained separately in order to minimize a regular least-squares loss function with ridge regularization:ŵ Here {ξ µ r } represents the readout noise which is present during training, and independently drawn: ξ µ r ∼ N (0, η 2 r ), E[ξ µ r ξ ν r ] = η 2 r δ µν . As a measure of model performance, we consider the generalization error, given by the mean-squarederror (MSE) on ensemble-averaged prediction: Here, the angular brackets represent an average over the data distribution and noise: ψ ∼ N (0, Σ s ), σ ∼ N (0, Σ 0 ), ξ r ∼ N (0, η 2 r ). The generalization error depends on the particular realization of the dataset D through the learned 3) |i−j| , ζ = 0.1, η = 0.2. We set k = 3 with ν1 = 0.2, ν2 = 0.4, ν3 = 0.6. Subsets of feature neurons accessed by each readout are drawn randomly and are permitted to overlap (see inset). Circular markers show the result of numerical experiment with M = 2000 feature neurons averaged over 100 trials. Curve shows theoretical prediction, obtained by solving the saddle-point equations 11 numerically. Theory and experiment conducted with a fixed ground-truth readout w * drawn randomly from an isotropic standard Gaussian distribution (b) Numerical experiment with [Σs] ij = (0.6)δij + 0.4, [Σ0] ij = .1δij, ζ = 0.1, η = 0.1. Ground truth weights are randomly sampled in each trial as in eq. 12 with ρ = .3. We set k = 3 with ν1 = 0.1, ν2 = 0.3, ν3 = 0.5. Subsets of feature neurons accessed by each readout are are mutually exclusive (see inset). Circular markers show the result of numerical experiment with M = 5000 feature neurons averaged over 100 trials. Error bars show the standard error of the mean, and are smaller than the markers. Curve shows analytical prediction obtained in the case of equicorrelated features.
weights {ŵ * }. We may decompose the generalization error as follows: Computing the generalization error of the model is then a matter of calculating E rr ′ in the cases where r = r ′ and r ̸ = r ′ . Furthermore, in the asymptotic limit we consider, we expect that the generalization error concentrates over randomly drawn datasets D.

B. Main Result
We calculate the generalization error using the replica trick from statistical physics. The result of our calculation is stated in proposition 1. The proof is lengthy, and can be found in the SI.
Proposition 1. Consider the ensembled ridge regression problem described in Section II A. Consider the asymptotic limit where M, P, {N r } → ∞ while the ratios α = P M and ν rr = Nr M , r = 1, . . . , k remain fixed. Define the following quantities:Σ Then the average generalization error may be written as: where where the pairs of order parameters {q r ,q r } for r = 1, . . . , K, satisfy the following self-consistent saddle-point equationŝ Proof. We calculate the terms in the generalization error using the replica trick from the statistical physics of disordered systems. The full derivation may be found in the supplemental material.
We make several remarks on this result: Remark 1. This is a highly general result which applies to any selection of linear masks {A r }. However, we will focus on the case where the {A r } implement subsampling of the feature neurons.
Remark 2. Our result reduces to the results of [35] when k = 1 and η = 0, and may be obtained as a special case of [36] in this limit. In the case where all readout weights have the same dimension N r = N, r = 1, . . . , k, this result may be obtained as a special case of the results of [30]. The novelty in our derivation (and subsequent analysis) is to consider heterogeneity in the values of N r .
Remark 3. The replica trick [37] is a non-rigorous but standard heuristic in the study of disordered systems. We confirm our results in simulations.
In Figure 1a, we confirm the result of the general calculation by comparing with numerical experiments. Experimental curves are generated by running ridge regression on randomly drawn datasets with M = 2000 features and averaging over the resulting error. We use highly structured data, feature noise, label noise, and readout noise (see caption for details). Each of k = 3 readouts sees a fixed but randomly drawn subset of features. Theory curves are calculated by solving the fixed-point equations 11 numerically for the chosen Σ s , Σ 0 and {A r } k r=1 then plugging the resulting order parameters into eq. 10.

C. Equicorrelated Data
Our general result allows the freedom to tune many important parameters of the learning problem: the correlation structure of the dataset, the number of ensemble members, the scales of noise, etc. However, the derived expressions are rather opaque, as they depend on the solution to a set of in general analytically intractable self-consistent equations for the order parameters. In order to better understand the phenomena captured by these expressions, we simplify them in the tractable special case in which features of the data are equicorrelated: Proposition 2. Consider the ensembled ridge regression problem described in section II A, and the result of proposition 1. Consider the special case in which we select the following parameters: with c ∈ [0, 1], ρ ∈ [−1, 1]. A label noise scale ζ ≥ 0 and readout noise scales η r ≥ 0 are permitted. Here P ⊥ = I M − 1 N 1 M 1 ⊤ M is a projection matrix which removes the component of w * 0 which is parallel to 1 M . The measurement matrices {A r } k r=1 have rows consisting of distinct one-hot vectors so that each of the k readouts has access to a subset of N r = ν rr M features. For r ̸ = r ′ , denote by n rr ′ the number of neurons sampled by both A r and A r ′ and let ν rr ′ ≡ n rr ′ /M remain fixed as M → ∞.
In this case, we may obtain fully analytical formulas for the generalization error as follows. First define the following quantities: The terms of the decomposed generalization error may then be written: where we have defined and where the order parameters {q r ,q r } may be obtained analytically as the solution (with q r > 0) to the following quadratic system of equations: In the "ridgeless" limit where λ → 0, we may make the following simplifications: Proof. Simplifying the fixed-point equations and generalization error formulas in this special case is an exercise in linear algebra. The main tools used are the Sherman-Morrison formula [38] and the fact that the data distribution is isotropic in the features so that the form ofΣ rr andΣ rr ′ depend only on N r , N r ′ , and n rr ′ . Thus, the result depends only on the values of {ν rr ′ } and not the identities of the subsampled features. To aid in computing the necessary matrix contractions we developed a custom Mathematica package which handles block matrices of symbolic dimension, with blocks containing matrices of the form M = c 1 I + c 2 11 ⊤ . This package and the Mathematica notebook used to derive these results will be made available online (see SI) In this tractable special case, c ∈ [0, 1] is a parameter which tunes the strength of correlations between features of the data. When c = 0, the features are independent, and when c = 1 the features are always equivalent. s sets the overall scale of the features and ρ tunes the alignment of the ground truth weights with the special direction in the covariance matrix. We refer to ρ as the "task alignment", and it can be thought of as a simple proxy for the "task-model alignment" [16] or "code-task alignment" [39]. In Figure 1b, we test these results by comparing the theoretical expressions for generalization error with the results of numerical experiments, finding perfect agreement. Note that in this case, both theory and experiment are averaged over ground-truth weights as well as datasets.
D. Subsampling shifts the double-descent peak of a linear predictor.
Consider the equicorrelated data model in the isotropic limit (c = 0). Consider a single linear regressor (k = 1) which connects to a subset of N = νM features. In the ridgeless limit where regularization λ → 0, and without readout noise or feature noise (η = ω = 0), the generalization error is given by equation 17 with ν rr = ν, s = 1, η r = ω = 0 in the λ → 0 limit: Double descent can arise from two possible sources of variance: explicit label noise (ζ > 0) or implicit label noise induced by feature subsampling (ν < 1). As E g ∼ (α − ν) −1 , we see that the generalization error diverges when α = ν. The subsampling fraction ν thus controls the sample complexity α at which the double-descent peak occurs. Intuitively, this occurs because subsampling changes the number of parameters of the regression model, and thus its interpolation threshold. To demonstrate this, we plot the learning curves for subsampled linear regression on equicorrelated data in Figure 2. While at finite ridge the test error no longer diverges when α = ν, it may still display a distinctive peak.

E. Heterogeneous connectivity mitigates double-descent
The observed phenomenon of double-descent -over-fitting to noise in the training set near a model's interpolation threshold -poses a serious risk in practical machine-learning applications. Regularization is the canonical strategy employed to mitigate double descent. However, in order to achieve monotonic learning, the regularization parameter must be tuned to the structure of the task and the scale of the label noise [15] -no one choice for the regularization parameter can mitigate double descent for all tasks. Considering again the plots in Figure 2(b), we observe that at any value of α, the double-descent peak can be avoided with an acceptable choice of the subsampling fraction ν. This suggests another strategy to mitigate double descent: heterogeneous ensembling. Rather than training an ensemble of linear predictors, each with the same interpolation threshold, we may ensemble over predictors with a heterogeneous distribution of interpolation thresholds in the hopes that when one predictor fails, the other members of the ensemble compensate. In Figure 3, we demonstrate that in the absence of a sufficiently regularization, heterogeneous ensembling can mitigate double-descent. Specifically. We define two ensembling strategies: in homogeneous ensembling, each of the k readouts is connected to the same fraction ν rr = 1 k features. In heterogeneous ensembling, the number of features connected by each of the k readouts are drawn i.i.d. from a Gamma distribution with fixed mean 1/k and variance σ 2 . We denote this ν rr ∼ Γ k,σ . After they are independently drawn, subsampling fractions are re-scaled so that they sum to unity: ν rr / r ν rr ← ν rr . This ensures fair competition, wherein the total number of readout weights utilized in homogeneous and heterogeneous ensembling are equal. Equivalently, we may consider the readout fractions ν rr to be drawn from a Dirichlet distribution: (ν 1 , . . . , ν k ) ∼ Dir((σk) −2 , . . . , (σk) −2 ) [40]. These strategies for connecting readouts to the features are illustrated for k = 10 in figures 3 a.i (homogeneous) and 3 a.ii (heterogeneous). The density of the distribution Γ k,σ (ν) is plotted in figure 3b for k = 10 and varying σ. In figure S1, we apply these ideas to a classification task on the CIFAR-10 dataset. We find that in this nonlinear setting, heterogeneous ensembling prevents catastrophic over-fitting, leading to monotonic learning curves without regularization (see SI for details).
In figure 3c, we use our analytical theory of equicorrelated data (see eqs. 17) to compare the performance of homogeneous and heterogeneous ensembling with k = 10. We find that for an under-regularized predictor, (3c.i, c.ii, c.iii) heterogeneous ensembling reduces the height of the double-descent peak. At larger regularization (3c.iv, c.v, c.vi), homogeneous and heterogeneous ensembling perform similarly. We quantify the extent of double-descent through the worst-case error max α (E g (α)). We find that as σ increases, the worst-case error decreases monotonically at no cost to the asymptotic error E g (α → ∞) (see Fig. 3d,e).
F. Data correlations, readout noise, and task structure determine optimal ensemble size We now ask whether ensembling is a fruitful strategy -i.e. whether it is preferable to have a single, fully connected readout or multiple sparsely connected readouts. Intuitively, the presence of correlations between features permits subsampling, as measurements from a subset of neurons will also confer information about the state of the others. In addition, ensembling over multiple readouts can average out the readout noise. To quantify these notions, we consider the special case of ensembling over k readouts, each connecting the same fraction ν rr = ν = 1 k of features in an equicorrelated code with correlation strength c and readout noise scale η, and task alignment ρ. We set the label noise, feature noise, and overlap between readouts to zero (ζ = 0, ω = 0, ν rr ′ = 0 when r ̸ = r ′ ). In the ridgeless limit, we can then express the error as : is an effective inverse signal-to-noise ratio We use the Gamma distribution with the convention that Γ k,σ (ν) is the probability density function of the Gamma distribution with mean k −1 and variance σ 2 . Shown here for k = 10 and σ indicated by the colorbar. (c) Generalization Error Curves for Homogeneous and Heterogeneous ensembling with k = 10 and indicated values of λ and σ. Curves are calculated using analytical theory for equicorrelated data with c = 0, η = 0, ζ = 0. Solid blue is the learning curve for homogeneous subsampling. Dotted red curves show loss curve for 5 realizations of the randomly drawn subsampling fractions {νrr} k r=1 . Solid red is the learning curve for heterogeneous ensembling averaged over 100 realizations of the subsampling fractions {νrr} k r=1 drawn independently from Γ k,σ (ν). (d) Average loss curves for heterogeneous ensembling with k = 10, λ = 10 −3 , and σ indicated by the colorbar. (e) Average worst-case error and asymptotic error as a function of variance for heterogeneous ensembling. Worst-case error is calculated for each realization of the subsampling fractions as maxα Eg(α|{νrr} k r=1 ). Average worst-case error is the worst-case error averaged over realizations of the subsampling fractions. Shaded region shows standard deviation over realizations of the subsampling fractions. and F (H, k, ρ, α) is a rational function of its arguments (see SI for full expressions). Therefore, given fixed parameters s, c, ρ, α, the value k * which minimizes error depends on η, s, and c only through the ratio H.
Using our analytical theory, we plot the optimal number of readouts k in the parameter space of H and ρ (see Fig. 4a). The resulting phase diagrams are naturally divided into three regions. In the signal-dominated phase a single fully-connected readout is optimal (k * = 1). In an intermediate phase, 1 < k * < ∞ minimizes error. And in a noise-dominated phase k * = ∞. The boundary between the signal-dominated and noise-dominated phases (dotted lines in 4a) can be written H = (1 − 1 α )(1 − ρ 2 ) when α > 1 and H = α(1 − α)(1 − ρ 2 ) when α < 1 . The boundary between the intermediate and noise-dominated phases (dashed lines in 4a) can be written H = 2 − (2 + 1 α )ρ 2 . As is evident in these phase diagrams, an increase in H causes an increase in k * . This can occur because of a decrease in the signal-to-readout noise ratio s/η 2 , or through an increase in the correlation strength c. An increase in ρ also leads to an increase in k * , indicating that ensembling is more effective when there is alignment between the structure of FIG. 4. Noise level and data correlation strength determine optimal readout strategy: Using analytical theory (see eq. 17), we calculate the generalization error of linear predictors on equicorrelated data ([Σs] ij = (1 − c)δij + c, 0 < c ≤ 1) with readout noise with variance η 2 . Ground truth weights are drawn as in eq. 12. For convenience, we set λ = 0, though results are qualitatively similar with small finite ridge. We consider k readouts, each connecting a fraction ν = 1/k of the feature neurons, so that the total number of readout weights is conserved. (a) Phase diagrams of optimal k in the parameter space of task alignment ρ and the inverse effective signal-to-noise ratio H ≡ η 2 s(1−c) . Color indicates the optimal number of readouts k * , with gray indicating k * = 1 and white indicating k * = ∞ We consider (i) α = 0.25, (ii) α = 0.75, (iii) α = 1.5, (iv) α = 10 3 . Black lines are analytically derived phase boundaries between regions of parameter space where finite optimal k * exists and where k * = ∞. Dotted black lines are phase boundaries of the type where k * jumps discontinuously from 1 to ∞. Dashed black lines are phase boundaries of the type where k * → ∞ from one side and k * = ∞ on the other. (b) for three choices of the parameters (H, ρ) we plot the learning curve for ensembled linear regression for a variety of k values (see colorbar), as well as k = ∞, indicated by the dotted black line. Depending on the region of parameter space, the optimal readout strategy may be to select k * = 1, 1 < k * < ∞, or k * = ∞. the data and the task. Learning curves from each of these phases for varying k are plotted in Fig. 4b. The resulting shifts in the location of the double-descent peak resemble those observed in practice for ensembling methods applied to linear classifiers [6].

III. CONCLUSION
In this paper, we provided a theory of feature-subsampled ensembling techniques focusing on feature-subsampled linear ridge regression. Our technique was the replica method from statistical physics which led us to derive an analytical formula for the typical case generalization error in the aforementioned setting. We solved these equations for a special case which revealed many interesting phenomena.
One of these phenomena relate to double descent [13,14]. In most machine learning applications, the size of the dataset is known at the outset and suitable regularization may be determined to mitigate double descent, either by selecting a highly over-parameterized model [13] or by cross-validation techniques (see for example [19]). However, in contexts where a single network architecture is designed for an unknown task or a variety of tasks with varying structure and noise levels, heterogeneous ensembling may be used to smooth out the perils of double-descent. Our analysis of ensembling in noisy neural networks suggests that an ensembling approach may be useful in improving the stability of analog neural networks, where readout noise is a significant problem (see, for example, [41]).
Much work remains to achieve a full understanding of the interactions between data correlations, readout noise, and ensembling. In this work, we have given a thorough treatment of the convenient special case where features are equicorrelated and readouts do not overlap. Future work should analyze ensembling for codes with an arbitrary correlation structure, in which readouts access randomly chosen, potentially overlapping subsets of features. This will require to average our expressions for the generalization error over randomly drawn masks {A r }. This problem has been thoroughly studied in the case where the entries of A r are i.i.d Gaussian [30], as in the ever-popular random feature model. Recent progress on the problem of non-Gaussian projections for a single readout has been made in [42].

IV. ACKNOWLEDGEMENTS
CP and this research were supported by NSF Award DMS-2134157. BSR was also supported by the National Institutes of Health Molecular Biophysics Training Grant NIH/ NIGMS T32 GM008313. We thank Jacob Zavatone-Veth and Blake Bordelon for thoughtful discussion and comments on this manuscript. In this section, we demonstrate that the qualitative insights gained from our analysis of the linear regression task with Gaussian data carries over to a practical machine learning task. In particular, we show that ensembling with heterogeneous readout connectivity can mitigate double-descent in the CIFAR10 classification task [43] without regularization. The experimental setup is as follows: To obtain a useful feature map, we first train a deep, fully connected multi-layer perceptron (MLP) to solve the CIFAR10 classification task. The MLP has three hidden layers of size N h = 1000. We use a training set of 50,000 images x µ ∈ R N0 , N 0 = 3072 from N out = 10 classes. The targets are assigned as one-hot vectors We write this network as follows: We use a MSE loss function: The predicted class is then assigned as the class corresponding to the component ofŷ with maximum value. Training for 2000 steps with full-batch gradient descent and the Adam optimizer at a learning rate of .001, the network achieves a training accuracy of 90% and a test accuracy of ∼ %50. The learned features ψ(x) are then extracted and new readout weights are trained using the homogeneous or heterogeneous ensembling strategies with modifications for handling multiple readout classes. An ensemble of k predictions is made: for r = 1, . . . , k. Here, each A r ∈ R Nr×Nin implements subsampling of a randomly drawn subset of N r features, and w r ∈ R Nout×Nr predict the class of the input from these subsampled features. The weights W r are trained independently using a pseudoinverse rule, which is equivalent to ridge regression in the limit of zero regularization. Finally, the predictions of the ensemble of readouts are combined using a mean with a nonlinear threshold: In supplemental figure 1, we demonstrate the performance of re-learning ensembles of readout weights using the homogeneous and heterogeneous ensembling strategies. To review, in homogeneous ensembling, the subsampling fractions ν rr = Nr M are chosen as ν rr = 1/k, r = 1, . . . , k. However, in heterogeneous subsampling, the weights are drawn from a gamma distribution with mean 1/k and variance σ 2 , then re-scaled so that r ν rr = 1. In figure S1, we use σ = 1/(2k).
We that the heterogeneous ensembling approach leads to a smooth learning curve without double descent. This effect is most pronounced in the plots of "test accuracy", which is the probability of incorrect classification of a test example. While homogeneous ensembling shifts the double-descent peak toward smaller P , heterogeneous ensembling eliminated the peak. Because the data is heavily correlated, there is no cost to the prediction performance at large P .

Appendix B: Generalization error of ensembled linear regression from the replica trick
Here we use the Replica Trick from statistical physics to derive analytical expressions for E rr ′ . We treat the cases where r = r ′ and r ̸ = r ′ separately. Following a statistical mechanics approach, we calculate the average generalization error over a Gibbs measure with inverse temperature β; In the limit where β → ∞ the gibbs measure will concentrate around the weight vectorŵ r which minimizes the regularized loss function. The replica trick relies on the following identity: where ⟨·⟩ D represents an average over all quenched disorder in the system. In this case, quenched disorder -the disorder which is fixed prior to and throughout training of the weights -consists of the selected training examples along with their feature noise and label noise: D = {ψ µ , σ µ , ϵ µ } P µ=1 . The calculation proceeds by first computing the average of the replicated partition function assuming n is a positive integer. Then, in a non-rigorous but standard step, we analytically extend the result to n → 0.

Diagonal Terms
We start by calculating E rr for some fixed choice of r. Noting that the diagonal terms of the generalization error E rr only depend on the learned weights w r , and the loss function separates over the readouts, we may consider the Gibbs measure over only these weights: Next we must perform the averages over quenched disorder. We first integrate over {ψ µ , σ µ , ξ µ r , ϵ µ } P µ=1 . Noting that the scalars are Gaussian random variables (when conditioned on A r ) with mean zero and covariance: To perform this integral we re-write in terms of {H r µ } P µ=1 , where Integrating over the H r µ we get: Next we integrate over Q r and add constraints. We use the following identity: Using the Fourier representation of the delta function, we get: Inserting this identity into the replicated partition function and substituting E rr (w a r ) = Q rr aa − ζ 2 we find: In order to perform the Gaussian integral over the {w a r }, we unfold over the replica index a. We first define the following: We then have for the integral over w We can finally write the replicated partition function as: We now make the following replica-symmetric ansatz: which is well-motivated because the loss function is convex. We may then rewrite the partition function as follows: Where the effective action is written: We have Where g is evaluated at the values of q, q 0 ,q,q 0 which minimize g, in accordance with the method of steepest descent.
To complete the calculation, we need to find ∂ Jq . We may do this by examining two of the saddle-point equations: These two equations may in principle be solved for the dominant values of q andq. Letting κ = λ + q, we get: Solving this system of equations, we find

Off-Diagonal Terms
We now calculate E rr ′ for r ̸ = r ′ . We now must consider the joint Gibbs Measure over w r and w r ′ : Where sums over r 1 and r 2 run over {r, r ′ }.
In order to perform the Gaussian integral over the {w a r }, we unfold in two steps. We first define the following: Unfolding over the replica indices, we then get: Note that the dimensionality of T r1r2 varies for different choices of r 1 and r 2 . Next, we unfold over the two readouts: The integral over w then becomes: We are now ready to make a replica-symmetric ansatz. The order parameter that we wish to constrain is Q rr ′ ab . Overlaps go between the weights from different replicas of the system as well as different readouts. The scale of the overlap between two measurements depends on their overlap with each other and with the principal components of the data distribution. An ansatz which is replica-symmetric but makes no assumptions about the overlaps between different measurements is as follows: Next step is to plug the RS ansatz into the free energy and simplify. To make calculations more transparent, we re-label the paramters in the RS ansatz as follows: In order to simplify log det (λI 2n + βQ), we note that this is a symmetric 2-by-2-block matrix, where each block commutes with all other blocks. We may then use [44]'s result to simplify.
ab Q r1r2 ab = n qq +qQ + qQ + rr +rR + rR + 2 cĉ +ĉC + cĈ and where the D r1r2 matrices are defined implicitly through the following equation: The D r1r2 matrices can be expressed in terms of the G r1r2 and their inverses by the standard 2 × 2 block matrix inversion formula (see, for example, [45]). Applying this formula gives the following: Collecting these terms, we may write the replicated partition function as follows: ⟨Z n ⟩ D = exp − nM 2 g q, Q, r, R, c, C,q,Q,r,R,ĉ,Ĉ Where the effective action is written: g q, Q, r, R, c, C,q,Q,r,R,ĉ,Ĉ = (B74) − qq +qQ + qQ + rr +rR + rR + 2 cĉ +ĉC + cĈ (B76) where we have defined G ≡ G rr G rr ′ G r ′ r G r ′ r ′ We have: Where g is evaluated at the values of Q, q, R, r, C, c,Q,q,R,r,Ĉ,ĉ which minimize g, in accordance with the method of steepest descent (and thus implicitly depend on J). This gives: All that remains is to calculate the saddle-point values ofq,r,ĉ and their derivatives with respect to J at J = 0.
These 6 equations can in principle be solved for {q, r, c,q,r,ĉ} and their derivatives with respect to J. Note that when J = 0, the saddle point equations B90, B91 are solved by setting c =ĉ = 0, and in this case the remaining saddle-point equations decouple over the readouts (as expected for independently trained ensemble members) giving: For readout r: and for readout r ′ : These are equivalent to the saddle-point equations for a single readout given in equation B87, B86 as expected for independently trained readouts. It is physically sensible that c = 0 when J = 0, because at zero source, there is no term in the replicated system energy function which would distinguish the overlap between two readouts from the same replica from the overlap between two readouts in separate replicas (we expect that the total overlap between readouts is non-zero, as we may still have C > 0). Differentiating the saddle point equations with respect to J will give a 6 by 6 system of equations for the derivatives of the order parameters. With foresight, we first calculate ∂ J D Evaluated at J = 0, we have the following: Differentiating equations B86, B87, B88, B89, B90, B91 and evaluating at J, c,ĉ = 0 we get: Here, we have introduced the subsampling fractions ν rr = Nr M and the overlap fractions ν rr ′ = n rr ′ M We will average the generalization error over readout weights drawn randomly from the space perpendicular to 1 M , with an added spike along the direction of 1 M : The projection matrix may be written P ⊥ = I M − 1 N 1 M 1 ⊤ M . The two components of the ground truth weights will yield independent contributions to the generalization error in the sense that ⟨E rr ′ ⟩ = (1 − ρ 2 )E rr ′ (ρ = 0) + ρ 2 E rr ′ (ρ = 1) (C5) Calculating E rr and E rr ′ is an exercise in linear algebra which is straightforward but tedious. To assist with the tedious algebra, we wrote a Mathematica package which can handle multiplication, addition, and inversion of matrices of symbolic dimension of the specific form encountered in this problem. This form consists of block matrices, where the blocks may be written as aδ M N I M + b1 M 1 ⊤ N , where a, b are scalars and δ M N ensures that there is only a diagonal component for square blocks (when M = N ). This package is included as supplemental material to this publication.

Off-Diagonal Terms
In this section, we calculate the off-diagonal error terms E rr ′ for r ̸ = r ′ , again making use of our custom Mathematica package to simplify contractions of block matrices of the prescribed form. By the isotropic nature of the equicorrelated data model, E rr ′ can only depend on the subsampling scheme through ν rr , ν r ′ r ′ , and ν rr ′ . We can thus, without loss of generality, write: A r = I nr×nr 0 nr×n r ′ 0 nr×ns 0 nr×l 0 ns×nr 0 ns×n r ′ I ns×ns 0 ns×l ∈ R Nr×M (C18) A r ′ = 0 n r ′ ×nr I n r ′ ×n r ′ 0 n r ′ ×ns 0 n r ′ ×l 0 ns×nr 0 ns×n r ′ I ns×ns 0 ns×l ∈ R N r ′ ×M (C19) where we have defined n s to be the number of features shared between the readouts, n r = N r − n s and n r ′ = N r ′ − n s and the count of remaining features l = M − n r − n r ′ − n s .

Details of Heterogeneous Subsampling Theory
In this section, we describe the method used to calculate loss curves for heterogeneous subsampling experiments seen in main text fig. 3. In each trial, Subsampling fractions {ν 1 , . . . , ν k } are generated according to the following process: 1. Each fraction ν rr is drawn independently from a Γ distribution with mean 1 k and variance σ 2 . ν rr ∼ Γ k,σ 2. The fractions are re-scaled in order to sum to unity: ν rr → ν rr /(ν 1 + · · · + ν k ) Equivaltly, the fractions ν rr are drawn from a Dirichlet distribution [40]. Then, the loss curves are calculated from the given fractions {ν rr } using equations C16, C17, C23, C24. The dotted red lines show the loss curves for 5 single trials. The solid red lines show the average loss curves from 100 trials. Note that we have defined our own convention for the parameterization of the Γ distribution in which the inverse of the mean and the standard deviation are specified. In terms of the standard "shape" and "scale" parameters, we have: Γ k,σ ≡ Γ shape = (kσ) −2 , scale = kσ 2 (D3)